Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Notebook Index:

Phase 1

Phase 2

Preparation of Test Data

Kaggle Submission Phase1

Kaggle Submission Phase2

Writeup Phase1

Writeup Phase2

Phase 1

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Data Description

Correlation with the Target Variable

As we an see above, here we have tried to find out the correlation of various indepenndent features in the dataset with the dependent variable or the target variable and tried understanding the feature correlation. We see that there are very few columns which are related to the target variable, thus making it all the. ore difficult to draw our predictions. Let's see how we proceed with it further!

Missing Values Analysis

Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column.

The above graph showing the number of missing values with respect to the columns.

Visualization Through Correlation Matrix

A Correlation Matrix is used to examine the relationship between multiple variables at the same time. When we do this calculation we get a table containing the correlation coefficients between each variable and the others. Now, the coefficient show us both the strength of the relationship and its direction (positive or negative correlations). In Python, a correlation matrix can be created using the Python packages Pandas and NumPy, for instance.

Dropping columns with high correlation

Conclusions of Correlation Matrix

Visual Exploratory Data Analysis

Idea of Target Variables(Count)

People who repaid the loan were almost 10 times than the defaulters.

Visualization Through Suite Type with respect to Target variable

Visualization Through Distribution of Applicant's Family Members Count with respect to Target variable

Correlation of columns with respect to Target Variable

As we can see EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3 are most negatively correlated.

Visualization Through Correlation with TARGET Variable

Above, we have plotted the graph of correlation of all the features with the target variable. As it can be seen above, there are approximately 30% features which are not at all correlated with the target variable. We can drop them as I do not believe that they will influence our predictions of the target variable!

Removing more columns on the basis of correlation with target variable.

Top 5 Correlated features with Target Variable

Conclusion from the Correlation Matrix

After we have dropped the unimportant or unrelated columns, we have plotted the correlation with the TARGET variable! This looks so much better than before. We can see that the remaning variables are quite correlated with the TARGET Variable!

Visualization Through Correlation Matrix

The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

we see that we have a lot of float, numeric, int and object data type values in the dataset!

Who's the highest borrower

Insights:

Basically, there are more number of females than the number of males in the dataset! Thus one may also conclude that FEMALES take more loans than the MALES! But it might be incorrect to assume that!

Who's the highest borrower

Conclusions:

In this plot we have drawn a graph whther or not person i.e man or woman defaults a loan. However, the count of females is higher than the count of males, the graph says something different. The number of females owning a car is more than the number of males owning a car and percent wise they default less.

Insights:

It appears that Females have much more of difficulty in gettng the loans as compared to the males.

Insights:

It appears that cash loans are majority in defaulting the laons.

Insights:

It appears that there is not much difference in the percent of defaults with respect to the property owned.

Seperating categorical and numerical data for Visualization

It appears that there is not much difference in the percent of defaults with respect to the property owned.

Who are major borrowers and what are their occupations

How economically stable are clients? Who are the most and least stable?

Value Counts in Categorical Features

Which category of occupants repay on time and are better clients for company to lend money?'

Preprocessing

Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model. The concepts that I will cover for this section of the project are-

Importing the necessary modules

Splitting the data into train and test Data

We are going to split the data into train and test Data so that we can perform a check on our model afterwards.

Seperating numerical and categorical variables to fit in Pipeline

Build processing pipelines

In this part of the project the focus is on constructing the pipeline. Since the data has both numerical and categorical features, it is required to create two pipelines (one for each category of data) because they require different transformations. After finishing that, the two pipelines should be unified to produce one full pipeline that performs transformation on all the dataset.

Modeling and Feature Engineering

Now that we have explored the data, cleaned it, preprocessed it and added a new feature to it, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, you will have a baseline logistic regression model and grid searches on different models. In the end, you will find out which parameters are the best for each algorithm and you will be able to compare the performance of the models with the baseline model.

Baseline Logistic Regression

Logistic regression (LR) is a statistical method similar to linear regression since LR finds an equation that predicts an outcome for a binary variable, Y, from one or more response variables, X. However, unlike linear regression the response variables can be categorical or continuous, as the model does not strictly require continuous data. To predict group membership, LR uses the log odds ratio rather than probabilities and an iterative maximum likelihood method rather than a least squares to fit the final model. This means the researcher has more freedom when using LR and the method may be more appropriate for nonnormally distributed data or when the samples have unequal covariance matrices. Logistic regression assumes independence among variables, which is not always met in morphoscopic datasets.

Fitting and Storing Results

Prediction probability of Class 0 and Class 1

Validation Accuracy

Log Loss

ROC-AUC Score

ROC Curve

Confusion Matrix

Baseline Random Forest

Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Fitting and Storing Results

Validation Accuracy

Log Loss

ROC-AUC Score

ROC Curve

Confusion Matrix

Phase 2

Working on Bureau data

Merging of Bureau with Bureau Balance

on SK_ID_BUREAU

Making a test dataframe which contains only target with SK_ID_CURR so that we can check the correlation with the merged bureau column so that we can drop irrelevant columns.

EDA and preprocessing on Merged Bureau Data

Merging of processed Bureau Data with Application Train Data

Joining on SK_ID_CURR

Storing the data to csv file so that if kernel crashes we can continue the work with this data and save time.

Working on POS Cash Balance

EDA and Preprocessing on POS_CASH_BALANCE

to remove irrelavent data using the correlation criterion and domain knowledge

Merging of POS CASH with Application Train on SK_ID_CURR

Working on Credit Card Balance

Dropping of columns with no or close to none significance with Target.

EDA and Preprocessing of Credit Card Balance

Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining

We will work on Installments Payments first and then join everything with Application train after joining installments payments and Credit Card Balance.

Working on Installments Payments

EDA and Preprocessing on Installments Payments

Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining

Joining Credit Card Balance and installments Payments on SK_ID_PREV and SK_ID_CURR, grouping by both tables by these both the ids by mean and then inner join on PREV only

EDA and Preprocessing on the merged file (credit card balance and installment payments)

We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.

Working on Previous Application Data

EDA and Preprocessing of Previous Application data

removing columns with high correlation and containing large percentage of NULL values

Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining

Merging of Previous Application and previously merged dataframe (Credit card balance + Installments Payments )

EDA and Preprocessing of Merged three Datasets (Previous Application + Credit card balance + Installments Payments)

We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.

df_prev_ccbalinst: Merged DataFrame(Previous Application + Credit card balance + Installments Payments)

Merging of df_PREV_CCBALINST with already merged Application_train on Bureau Data

Left joining on SK_ID_CURR

EDA and Preprocessing of Final Merged Data

We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.

Feature Engineering

Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.

Correlation of features with Target Variable in order to engineer new fatures

Expert knowledge features

Often, experts have domain knowledge about what combination of existing features have strong explanatory/predictive power. In this case we are looking at the following features

Feature 1

df_final_merged['DAYS_EMPLOYED_PCT'] = df_final_merged['DAYS_EMPLOYED'] / df_final_merged['DAYS_BIRTH']

Feature 2

df_final_merged['CREDIT_INCOME_PCT'] = df_final_merged['AMT_CREDIT'] / df_final_merged['AMT_INCOME_TOTAL']

Feature 3

df_final_merged['ANNUITY_INCOME_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_INCOME_TOTAL']

Feature 4

df_final_merged['CREDIT_TERM_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_CREDIT']

Feature 5

df_final_merged['AMT_BALANCE_PCT'] = df_final_merged['AMT_BALANCE'] / df_final_merged['DAYS_CREDIT']

Feature 6

df_final_merged['AVG_INCOME_EXT_PCT'] = (df_final_merged['EXT_SOURCE_1']+df_final_merged['EXT_SOURCE_2'] +df_final_merged['EXT_SOURCE_3'])/3

Feature 7

df_final_merged['AVG_TOTALINCOME_PCT'] = df_final_merged['AMT_INCOME_TOTAL']/df_final_merged['AVG_INCOME_EXT_PCT']

Hyper-Parameter Tuning

Seperating the dataframe in features and Target Varible

Splitting the Data into Train and Test Data

Seperating numerical and categorical variables to fit in Pipeline

Build processing pipelines

In this part of the project the focus is on constructing the pipeline. Since the data has both numerical and categorical features, it is required to create two pipelines (one for each category of data) because they require different transformations. After finishing that, the two pipelines should be unified to produce one full pipeline that performs transformation on all the dataset. After this we will perform hyperparmeter tuning using GridSearchCV and taking different parameters for training the model. We used Logistic Regression and Random forest algorithms.

Modelling, Fitting and Storing Results

Logistic Regression with GridSearchCV

Logistic regression (LR) is a statistical method similar to linear regression since LR finds an equation that predicts an outcome for a binary variable, Y, from one or more response variables, X. However, unlike linear regression the response variables can be categorical or continuous, as the model does not strictly require continuous data. To predict group membership, LR uses the log odds ratio rather than probabilities and an iterative maximum likelihood method rather than a least squares to fit the final model. This means the researcher has more freedom when using LR and the method may be more appropriate for nonnormally distributed data or when the samples have unequal covariance matrices. Logistic regression assumes independence among variables, which is not always met in morphoscopic datasets.

Prediction probability of Class 0 and Class 1
Prediction probability of Class 0 and Class 1

Validation Accuracy

Log Loss

ROC-AUC Curve

Curve

Confusion Matrix

Random Forest Classifier with GridSearchCV

Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Validation Accuracy

Log Loss

ROC-AUC Score

ROC Curve

Confusion Matrix

Preparing Test Data

Missing Value Percent in Test Data

Test Data Exploration

Datasets Questions

Kaggle Submission Phase1

pred_proba_df

Kaggle Submission Phase2

Kaggle submission via the command line API

report submission

Click on this link

Kaggle.jpeg

Write Up Phase 1

Phase 1 - Final Project HCDR EDA + Baseline

Home Credit Default Risk Kaggle Competition

PROJECT MEMBERS:

ADITI MULYE - adimulye@iu.edu

KESHAV LIKHAR - klikhar@iu.edu

NIKUNJ MALPANI - nmalpani@iu.edu

PRASHASTI KARLEKAR - prkarl@iu.edu

PROJECT ABSTRACT

The course project is based on the Home Credit Default Risk (HCDR) (https://www.kaggle.com/c/home-credit-default-risk/). The objective of Home Credit Default Risk project is to correctly offer loans to individuals who can pay back and turn away those who cannot. In other words, the goal is to predict whether or not a client will repay a loan. In this phase of the project, we will perform analysis and modelling of the Home Credit default risk data currently hosted on Kaggle. The objective of this phase of the project is to perform exploratory data analysis on the data, which includes describing the data, calculating the summary statistics on the data to summarize its main characteristics, visualizing the results, finding missing items count for all the input variables , etc. This step tells us what the data can tell us before we start modeling the data and to perform initial investigations on data so as to discover patterns, to spot irregularities and biases, and to check assumptions with help of summary statistics. The EDA process also involved dropping the columns with the most missing value counts. Subsequently, we built correlation matrix to find and remove highly correlated variables which would render our model inefficient. To avoid any inaccuracy, we Split the dataframe into train and test subsets. Next, we create pipelines to separately impute numerical and categorical values.We use one-hot encoding to transform the categorical variable values. Finally, we have used Logistic Regression and Random Forest to predict our target variable according to our input variables.

PROJECT DESCRIPTION

  1. Data description The home-credit data is currently available on Kaggle for predicting whether or not a client will repay a loan or have difficulty, which is a critical business need. In this dataset we have a total of 122 columns. There are a lot of columns which have missing values in the dataset. As we can already see that there are so many missing values in the top 10 rows itself. We'll have to figure out a way to deal with this! There are 7 different sources of data:

WhatsApp Image 2021-11-16 at 11.27.10 PM.jpeg

Task to be tackled

The task at hand for this phase of the project is to

Provide diagrams to aid understanding the workflow

WhatsApp Image 2021-11-16 at 11.13.02 PM.jpeg

EXPLORATORY DATA ANALYSIS

Before deriving insights from the data, we need to know the data which gives a better idea of the problem at hand and the irregularities in the data are exposed with further analysis which need to be corrected before building the model. Here, we look at application_train.csv dataset.

Steps performed in this phase of project:

WhatsApp Image 2021-11-16 at 11.19.32 PM.jpeg

Identify the types of data available - Before proceeding for the further analysis of the data, we determine the datatypes of our variables. This uncovers several issues that can be encountered later, for example not converting the datatypes of the variables which seem to be correct but essentially need to be represented as another datatype. datasets['application_train'].info(verbose=True) info() prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

WhatsApp Image 2021-11-16 at 11.19.31 PM.jpeg

Evaluate basic statistical information about the data To get an idea of the average value of the data in the data set, we measure the central tendencies of our dataset. datasets["application_train"].describe(include='all') describe() prints descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution.

WhatsApp Image 2021-11-16 at 11.13.13 PM (1).jpeg

Finding number of numerical variables and categorical variables It is important to deal with numerical and categorical variables differently. Categorical features have a lot to say about the dataset thus they should be converted to numerical to make it into a machine-readable format.

WhatsApp Image 2021-11-16 at 11.30.14 PM.jpeg

Check and determine missing data
Before we can use data with missing data fields, we need to transform these fields to be used for analysis and modelling. It is very important to handle missing data either by deleting or through imputation (handling the missing values with some estimation).

WhatsApp Image 2021-11-16 at 11.14.55 PM.jpeg

We determine the number of missing data in every column and sort the values in descending order according to the missing count to determine which features have the most missing values. We transform our dataframe to include only the columns which have more than 50 percent non-missing values. Next up, we impute the remaining missing data with median for numerical variables and the most frequent variable for the categorical variables.

WhatsApp Image 2021-11-16 at 11.14.55 PM (1).jpeg

Visualization through correlation matrix
A Correlation Matrix is used to examine the relationship between multiple variables at the same time. When we do this calculation we get a table containing the correlation coefficients between each variable and the others. Now, the coefficient show us both the strength of the relationship and its direction (positive or negative correlations). In Python, a correlation matrix can be created using the Python packages Pandas and NumPy, for instance. If we have a big data set, and we have an intention to explore patterns. For use in other statistical methods. For instance, correlation matrices can be used as data when conducting exploratory factor analysis, confirmatory factor analysis, structural equation models. Correlation matrices can also be used as a diagnostic when checking assumptions for e.g. regression analysis.

WhatsApp Image 2021-11-16 at 11.14.55 PM (2).jpeg

Above, we have plotted the graph of correlation of all the features with the target variable. As it can be seen above, there are approximately 30% features which are not at all correlated with the target variable. We can drop them as we do not believe that they will influence our predictions of the target variable! Next, we drop the irrelevant variables which are highly correlated with one another.

WhatsApp Image 2021-11-16 at 11.14.55 PM (3).jpeg

Conclusions of the Correlation Matrix After we see that there are many independent or feature variable which are related to each other, we will deal with them in a way that we drop the repetitive features, i.e. delete one of those columns which are highly correlated with each other! Thus, as we can see above, I have removed all those columns that ae related to each other, just leaving out one!

VISUAL EXPLORATORY DATA ANALYSIS

  1. In this phase, we derive meaningful insights from the data by answering reasonable questions that we found that need to be answered from the earlier steps. How is the distribution of target labels? - Did most people return on time ? According to the description of the data -"1 indicates client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 indicates all other cases".

WhatsApp Image 2021-11-16 at 11.14.55 PM (4).jpeg

  1. How is the distribution of Applicant’s Family Members Count?

WhatsApp Image 2021-11-16 at 11.19.31 PM (1).jpeg

Above, we have plotted a graph which describes the loan applicant's family members. As we can see above, there are hardly any people who have a count of family members greater than 5.

Who is the highest borrower ?

WhatsApp Image 2021-11-16 at 11.14.55 PM (5).jpeg

As we can see Females are the highest borrowers.

PAIR BASED VISUALIZATION

WhatsApp Image 2021-11-16 at 11.45.38 PM.jpeg

Pairplot showing correlation with target variable for highly correlated variable with target.

MODELING PIPELINES

In this section of the project, the following steps have been performed: Standardizing the data - The numerical variables have been standardized. Handling missing values - Following the good practice of imputing the missing values in the dataset, we impute the numerical as well as categorical variables as follows. All the work has been performed in pipelines. The numerical variables have been imputed with the median and the categorical variables have been imputed with the most frequent variable.

Handling categorical variables We have used one of the most common and efficient way to handle this transformation, ONE-HOT ENCODING. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature. We want to implement the one-hot encoding to the data set, in such a way that the resulting sets are suitable to use in machine learning algorithms.

WhatsApp Image 2021-11-16 at 11.14.56 PM.jpeg

Model training : Now that we have explored the data, cleaned it, preprocessed it and added a new feature to it, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, wel have a baseline logistic regression model and a random forest model. In the end, we will be comparing the performance of the models with the baseline models.

Baseline Logistic Regression

Logistic Regression is a powerful algorithm for classification problems that fit models for categorical data, especially for binary classification problems. Since our target (dependent) variable is categorical, using logistic regression can directly predict the probability that a customer is creditworthy (able to meet a financial obligation in a timely manner) or not, using a number of predictors.

Loss function used (data loss and regularization parts) in latex -

1234.jpeg

Accuracy and AUC/ROC - Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. Below is the accuracy and the AUC/RUC of our logistic regression model.

WhatsApp Image 2021-11-16 at 11.19.31 PM (2).jpeg

Baseline Random Forest

Random Forest is one of the most popular algorithms which assembles a large number of decision trees from the training dataset. Each decision tree represents a class prediction. This method collects the votes from these decision trees and the class with the most votes is considered as the final class. Random forest outperforms linear models because it can catch non-linear relationships between the object and the features.

WhatsApp Image 2021-11-16 at 11.19.31 PM (3).jpeg

Loss function used (data loss and regularization parts) in latex -

3243423.jpeg

Accuracy and AUC/ROC - Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. ROC / AUC: An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate False Positive Rate AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

Below is the accuracy and the AUC/RUC of our random forest model.

WhatsApp Image 2021-11-16 at 11.19.31 PM (4).jpeg

Number of experiments conducted We have selected 2 models for prediction : Logistic Regression and Random Forest Classifier

Write Up Phase 2

FP Phase 2 - Final Project HCDR - feature engineering + hyperparameter tuning

Home Credit Default Risk Kaggle Competition

PROJECT MEMBERS:

ADITI MULYE - adimulye@iu.edu

KESHAV LIKHAR - klikhar@iu.edu

NIKUNJ MALPANI - nmalpani@iu.edu

PRASHASTI KARLEKAR - prkarl@iu.edu

p1.jpeg

PROJECT ABSTRACT

In this phase of the project, our aim is to build on our previous observations from Phase 1 and improve the overall accuracy of our model. The model built in Phase 1 resulted in an accuracy of ~73% on Kaggle and considering the pitfalls of not incorporating additional techniques in our model, we attempt to improve the quality of our dataset and consequently our model by performing Feature Engineering, merging all the datasets available in Home Credit Default Risk dataset, and performing hyper-parameter tuning using GridSearchCV. The first part of the Phase 2 deals with merging all the datasets available to us on the basis of primary key/ combination of keys in the datasets. To check on the progress of the quality of our dataset, we perform exploratory data analysis like heatmap , correlation matrix and missing number visualization on each dataset to better understand the important features in the individual dataset and removing the insignificant features on the basis of missing value counts and highly correlated features. The second part of the Phase 2 uses the merged dataset to build additional features in our dataset which we consider can improve the predictive power of our model. These domain knowledge features prove to be successful in attempting to increase the quality of the model as they are highly correlated with the target variable. The third part of Phase 2 revolves around building pipelines and tuning the hyperparameters with GridSearchCV and incorporate the most significant in our model. We then predict on our final dataset using Logistic Regression and Random Forest algorithms thereby improving the accuracy to 92.1% which was 91.85 before and on Kaggle ~74%.

PROJECT DESCRIPTION

  1. Data description The home-credit data is currently available on Kaggle for predicting whether or not a client will repay a loan or have difficulty, which is a critical business need. In this dataset we have a total of 122 columns. There are a lot of columns which have missing values in the dataset. As we can already see that there are so many missing values in the top 10 rows itself. We'll have to figure out a way to deal with this! There are 7 different sources of data:

p2131.jpeg

Task to be tackled

The task at hand for this phase of the project is to

flowchart.jpeg

Feature Engineering

The better the features that we prepare and choose from the dataset, the better the results. And this is why we perform feature engineering. We need great features that describe the structures inherent in the data. The main goal of Feature Engineering is to get the best results from our algorithms.

The following is the initial Feature Engineering performed on the dataset in Phase 1 :

In this phase, in order to improve the accuracy of our model and to utilize all the features and data from all the datasets, we first merge the datasets together.

Lets look at each dataset individually and merge. More about features engineered are explained in detail below in the notebook.

Using Bureau Dataset

This dataset has all client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date. Here, one of the interesting features is 'CREDIT_ACTIVE' which shows the data about the status of the credits.

p2.jpeg

Merging of Bureau with Bureau Balance

We merge Bureau abd Bureau Balance datasets on on SK_ID_BUREAU.

p3.jpeg

Here, we are creating a test dataframe which contains only target with SK_ID_CURR so that we can check the correlation with the merged bureau column so that we can drop irrelevant columns. Since this dataset has not been preprocessed, we perform EDA on this merged dataset by dropping missing values and columns which are highly correlated with the target column from our merged dataset.

p4.jpeg

From the missing no graph, we see that 'AMT_CREDIT_SUM_LIMIT' has a lot of missing values, so we impute this column by replacing null values with 0.

The merged dataset of bureau and bureau balance only contains the features that are important to the final dataset and hence, we perform feature selection to choose only the most important features of the dataset.

p5.jpeg

p6.jpeg

Merging of processed Bureau Data with Application Train Data

We merge our df_bureau_final with application_train on SK_ID_CURR.

p7.jpeg

Working on POS Cash Balance

This dataset consists of monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.

We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features as derived from the correlation matrix.

p8.jpeg

Merging of POS CASH with Application Train on SK_ID_CURR

We merge our df_POS with application_train on SK_ID_CURR

p9.jpeg

Using Credit Card Balance

This dataset contains the monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features which have no or insignificant correlation witht the target variable.

p10.jpeg

We also group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining this dataset.

Next, we will work on Installments Payments first and then join everything with Application train after joining installments payments and Credit Card Balance.

Using Installments Payments

This dataset contains the data of payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features which have no or insignificant correlation witht the target variable.

p11.jpeg

We group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining the datasets.

p12.jpeg

Joining Credit Card Balance and installments Payments on SK_ID_PREV and SK_ID_CURR, grouping by both tables by these both the ids by mean and then inner join on PREV only

We then perform EDA and preprocess the data in the merged dataset of credit card balance and installment payments. We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.

p13.jpeg

Using Previous Application Data

This dataset contains the data of previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.

We remove the columns with high correlation and containing large percentage of NULL values while performing EDA and preprocessing this dataset.

p14.jpeg

We group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining

Joining Previous Application and previously merged dataframe (Credit card balance + Installments Payments)

After merging all the preprocessed daasets, we perform EDA and preprocess this merged dataset as well because there might be some columns that are now insignificant or higly correlated with each other.

p15.jpeg

Merging all the pre-processed merged datasets

We merge df_PREV_CCBALINST with already merged Application_train on Bureau Data. We left join the datasets on SK_ID_CURR.

We then perform EDA on the final merged dataset because because there might be some columns that are now insignificant or higly correlated with each other.

p16.jpeg

Additional features added to the final dataset

While performing EDA and with our domain knowledge, we found out some existing features whose combination can have strong explanatory/predictive power.

Here is the description of the additional features :

Feature 1 : 'DAYS_EMPLOYED_PCT' :

Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.

We have taken this parameter because the total number of days a person has been employed in his life time will impact in his credit paying ability and since this is highly correlated to the target variable we have taken this feature to create new feature out of it.

p17.jpeg

Feature 2 : 'CREDIT_INCOME_PCT' :

Available credit as a percentage of income - If a person has a very large amount of credit available as a percentage of income, this can impact his ability to pay off the loans If a person has a very high credit limit with respect to its earning then this will effect much in its loan paying capability and he or she will or maybe most of the times will not be able to pay back.

p18.jpeg

Feature 3 : 'ANNUITY_INCOME_PCT' :

Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.

We have taken this feature because of the reason mentioned in the previous line and since it is highly correlated with the target variable.

p19.jpeg

Feature 4 : 'CREDIT_TERM_PCT' :

Annuity as a percentage of available credit - If a person receives an annuity, this is more stable source of income thus if it is a high percentage compared to his/her credit availability then the person is more likely be able to pay off his debts.

Feature 5 : 'AMT_BALANCE_PCT' :

Amount Balance as the percentage - The remaining balance in the account is very much an important feature as this tells much about a person's ability to pay back it's debt. This is the reason that we took into accoun this feature also because of it's correlation with the target feature.

Feature 6 : 'AVG_INCOME_EXT_PCT' :

Average of three External Sources of Income - Since these three features are the highest among all the other features that are correlated with the target variable, we took the average of all the three incomes and created a new feature that tells us how they are effecting the prediction of target varible. This feature is almost 13% correlated.

Feature 7: 'AVG_TOTALINCOME_PCT' :

Total Income avergae as percentage - This is the best feature that we generated out of all the seven newly made features. This shows the total income as a percent of the person's earning with respect to it's external income. We engineered this using the Feature 6 that we created above and this is almost 22% negatively correlated with the get variable.

After adding these additional features, we analyze the expert features and find that AVG_TOTALINCOME_PCT and DAYS_EMPLOYED_PCT ranks highly as measured by correlation in relation to TARGET.

p20.jpeg

corrrrr.jpeg

Hyperparameter Tuning

Hyperparameters contain the data that govern the training process itself. Hyperparameter tuning works by running multiple trials in a single training job. In this part of the phase, we fit and store our result with GridSearchCV.

Modeling Pipelines

Modeling Pipelines (HCDR)

In this section of the project, the following steps have been performed: Standardizing the data - The numerical variables have been standardized. Handling missing values - Following the good practice of imputing the missing values in the dataset, we impute the numerical as well as categorical variables as follows. All the work has been performed in pipelines. The numerical variables have been imputed with the median and the categorical variables have been imputed with the most frequent variable.

Handling categorical variables

We have used one of the most common and efficient way to handle this transformation, ONE-HOT ENCODING. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature. We want to implement the one-hot encoding to the data set, in such a way that the resulting sets are suitable to use in machine learning algorithms.

p21.jpeg

Model training

Now that we have explored the data, preprocessed it, and performed feature engineering along with hyper parameter tuning, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, we have our upgraded logistic regression model and a random forest model with GridSearchCV. In the end, we will be comparing the performance of the models.

Logistic Regression

Logistic Regression is a powerful algorithm for classification problems that fit models for categorical data, especially for binary classification problems. Since our target (dependent) variable is categorical, using logistic regression can directly predict the probability that a customer is creditworthy (able to meet a financial obligation in a timely manner) or not, using a number of predictors.

Logistic Regression with GridSearchCV

p22.jpeg

p23.jpeg

Validation Accuracy

The validation accuracy of Logistic Regresion Model was 0.9198721421605808

Log Loss

Loss function used (data loss and regularization parts) in latex -

The log loss of the model was 2.7675196811129577

ROC-AUC Curve

Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. Below is the accuracy and the AUC/RUC of our logistic regression model.

The ROC-AUC Curve of the model was 0.738818586860982

ROC-AUC curve (Logistic Regression)

p24.jpeg

Confusion Matrix of Logistic Regression Classifier

Screen Shot 2021-12-07 at 10.03.32 PM.png

Random forest Classifier with GridSearchCV

Random Forest is one of the most popular algorithms which assembles a large number of decision trees from the training dataset. Each decision tree represents a class prediction. This method collects the votes from these decision trees and the class with the most votes is considered as the final class. Random forest outperforms linear models because it can catch non-linear relationships between the object and the features.

p26.jpeg

Experiment Table with all the details of the experiment:

Screen Shot 2021-12-07 at 10.05.25 PM.png

Validation Accuracy

Accuracy represents the number of correctly classified data instances over the total number of data instances. ROC / AUC: An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate False Positive Rate AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

The validation accuracy of Random Forest Model was 0.9209556831726081

Log Loss

Loss function used (data loss and regularization parts) in latex - The log loss of the model was 2.730093984189766

ROC-AUC Curve

The ROC-AUC Curve of the model was 0.7241955610236327

ROC-AUC Curve ( Random Forest )

p28.jpeg

Confusion Matrix of Random Forest

p29.jpeg

Conclusion and Future Scope

Initially, we tried merging all the different datasets and created several new columns to understand the relation with target column and gain fruitful insights from the dataset. On analysis, we found out that out of the new columns, a few of them were highly correlated with the target variable, thus helping us improve the model predictions. We also tried using synthetic sampling techniques like SMOTE and ADASYN to check if there was any change in the model performance.

Finally, we created a pipeline to perform Logistic Regression and Random Forest Classifier, using Hyperparameter Tuning to gather the best features Successively, we ran our model on the best parameters and found out an increase in the model accuracy and submitted the file on Kaggle.

For the Phase II of this Project, we built Logistic Regression and Random Forest Model, performed Hyperparameter Tuning using GridSearchCV and achieved an accuracy of over 92.1% with Random Forest and 92% with Logistic Regression. Initially, we tried merging all the different datasets and created several new columns to understand the relation with target column and gain fruitful insights from the dataset. On analysis, we found out that out of the new columns, a few of them were highly correlated with the target variable, thus helping us improve the model predictions. We also tried using synthetic sampling techniques like SMOTE and ADASYN to check if there was any change in the model performance. Finally, we created a pipeline to perform Logistic Regression and Random Forest Classifier, using Hyperparameter Tuning to gather the best features Successively, we ran our model on the best parameters and found out an increase in the model accuracy and submitted the file on Kaggle.

The next steps in our project would be to improve our model performance through implementation of Neural Networks.

Screen Shot 2021-12-07 at 10.09.11 PM.png

Kaggle Submission

Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.

Kaggle.jpeg

References

https://github.iu.edu/jshanah/I526_AML_Student/tree/master/Labs/Labs-09-Pipelines-Column_Transformer_2020_10

https://github.iu.edu/jshanah/I526_AML_Student/tree/master/Labs/Labs-10.1-Decision-Trees

https://machinelearningmastery.com/random-forest-ensemble-in-python/

https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics

https://builtin.com/data-science/random-forest-algorithm

https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

https://github.com/rakshithvasudev/Home-Credit-Default-Risk

https://randlow.github.io/posts/machine-learning/kaggle-home-loan-credit-risk-feat-eng-p3/